categories: - "" - "" date: “2017-10-31T21:28:43-05:00” description: “About Me” draft: false keywords: "" slug: project1 title: Life & HIV author: “Study Group 13” date: “Sep 20, 2020” output: html_document: theme: flatly highlight: zenburn number_sections: yes toc: yes toc_float: yes code_folding: show —
If we wanted to study climate change, we can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here
To define temperature anomalies you need to have a reference, or base, period which NASA clearly states that it is the period between 1951-1980.
Run the code below to load the file:
weather <- read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v3/NH.Ts+dSST.csv",
skip = 1, #skip one row as real data starts from row 2
na = "***") #missing data is coded as *** so we specify that here
weather
## # A tibble: 140 x 19
## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1880 -0.54 -0.38 -0.26 -0.37 -0.11 -0.22 -0.23 -0.24 -0.26 -0.32 -0.37
## 2 1881 -0.19 -0.25 0.02 -0.02 -0.06 -0.36 -0.06 -0.03 -0.23 -0.4 -0.42
## 3 1882 0.22 0.22 0 -0.36 -0.32 -0.38 -0.37 -0.14 -0.17 -0.53 -0.32
## 4 1883 -0.59 -0.67 -0.16 -0.27 -0.32 -0.26 -0.09 -0.26 -0.33 -0.21 -0.4
## 5 1884 -0.23 -0.11 -0.65 -0.62 -0.42 -0.52 -0.48 -0.5 -0.45 -0.41 -0.48
## 6 1885 -1 -0.37 -0.21 -0.53 -0.55 -0.47 -0.39 -0.44 -0.32 -0.3 -0.28
## 7 1886 -0.68 -0.68 -0.570 -0.34 -0.34 -0.43 -0.2 -0.47 -0.34 -0.31 -0.45
## 8 1887 -1.07 -0.580 -0.36 -0.42 -0.27 -0.2 -0.23 -0.52 -0.17 -0.4 -0.19
## 9 1888 -0.53 -0.59 -0.580 -0.24 -0.16 -0.04 0.04 -0.19 -0.12 0.04 -0.03
## 10 1889 -0.31 0.35 0.07 0.15 -0.05 -0.12 -0.1 -0.16 -0.26 -0.34 -0.61
## # … with 130 more rows, and 7 more variables: Dec <dbl>, `J-D` <dbl>,
## # `D-N` <dbl>, DJF <dbl>, MAM <dbl>, JJA <dbl>, SON <dbl>
Next, we will clean the dataset - discard unwanted columns and convert the wide format to a long format.
# dropping columns we don't need
weather2 <- weather %>% select(-`J-D`, -`D-N`, -DJF, -MAM, -JJA, -SON)
weather2
## # A tibble: 140 x 13
## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1880 -0.54 -0.38 -0.26 -0.37 -0.11 -0.22 -0.23 -0.24 -0.26 -0.32 -0.37
## 2 1881 -0.19 -0.25 0.02 -0.02 -0.06 -0.36 -0.06 -0.03 -0.23 -0.4 -0.42
## 3 1882 0.22 0.22 0 -0.36 -0.32 -0.38 -0.37 -0.14 -0.17 -0.53 -0.32
## 4 1883 -0.59 -0.67 -0.16 -0.27 -0.32 -0.26 -0.09 -0.26 -0.33 -0.21 -0.4
## 5 1884 -0.23 -0.11 -0.65 -0.62 -0.42 -0.52 -0.48 -0.5 -0.45 -0.41 -0.48
## 6 1885 -1 -0.37 -0.21 -0.53 -0.55 -0.47 -0.39 -0.44 -0.32 -0.3 -0.28
## 7 1886 -0.68 -0.68 -0.570 -0.34 -0.34 -0.43 -0.2 -0.47 -0.34 -0.31 -0.45
## 8 1887 -1.07 -0.580 -0.36 -0.42 -0.27 -0.2 -0.23 -0.52 -0.17 -0.4 -0.19
## 9 1888 -0.53 -0.59 -0.580 -0.24 -0.16 -0.04 0.04 -0.19 -0.12 0.04 -0.03
## 10 1889 -0.31 0.35 0.07 0.15 -0.05 -0.12 -0.1 -0.16 -0.26 -0.34 -0.61
## # … with 130 more rows, and 1 more variable: Dec <dbl>
# converting dataframe from wide to long format using pivot_longer( )
tidyweather <- weather2 %>% pivot_longer(cols=2:13, names_to="month", values_to="delta")
tidyweather
## # A tibble: 1,680 x 3
## Year month delta
## <dbl> <chr> <dbl>
## 1 1880 Jan -0.54
## 2 1880 Feb -0.38
## 3 1880 Mar -0.26
## 4 1880 Apr -0.37
## 5 1880 May -0.11
## 6 1880 Jun -0.22
## 7 1880 Jul -0.23
## 8 1880 Aug -0.24
## 9 1880 Sep -0.26
## 10 1880 Oct -0.32
## # … with 1,670 more rows
Inspecting our dataframe, we have three variables now - one each for year, month and delta (temperature deviation).
We will plot the data using a time-series scatter plot, and add a trendline. But first we use ‘lubridate’ for dates to ensure that delta plots chronologically.
# reformatting dates
tidyweather <- tidyweather %>%
mutate(date = ymd(paste(as.character(Year), month, "1")), # lubridate used here
month = month(date, label=TRUE),
Year = year(date))
# plotting temperature deviation
ggplot(tidyweather, aes(x=date, y = delta))+
geom_point(alpha=0.5) +
geom_smooth(color="red") +
theme_bw() +
theme(
plot.title=element_text(face="bold")
) +
labs (
title = "Weather Anomalies",
subtitle = "Trends in temperature deviations over the period 1880 - 2019",
x = "Year",
y = "Temperature Deviation"
)
Next, we study the effect of increasing temperature by month.
# plotting scatterplot and trend lines by month
ggplot(tidyweather, aes(x=date, y = delta))+
geom_point(size=0.5, alpha=0.6)+
geom_smooth(color="red") +
facet_wrap(~month) + # to separate plots by month
theme_bw() +
theme(
plot.title=element_text(face="bold")
) +
labs (
title = "It's Getting Hot in Here",
subtitle="Trends in temperature deviations BY MONTH between 1880 and 2019",
y="Temperature Deviation",
x="Year"
)
> Overall, we see a gradual rise in the average temperature across all twelve months from 1880 to 2019 owing to Global Warming. This rise has been greater for winter months such as January, February, November and December, compared to summer months such as May, June, July and August. > Another point worth mentioning is that the winter months depict greater deviation (the points are more scattered around the trendline) compared to summer months which have relatively lower deviation (the points are closer to the trendline)
Since it is sometimes useful to group data into different time periods to study historical data, we create a new data frame called comparison that groups data into five time periods: 1881-1920, 1921-1950, 1951-1980, 1981-2010 and 2011-present.
comparison <- tidyweather %>%
filter(Year>= 1881) %>% #remove years prior to 1881
#create new variable 'interval', and assign values based on criteria below:
mutate(interval = case_when(
Year %in% c(1881:1920) ~ "1881-1920",
Year %in% c(1921:1950) ~ "1921-1950",
Year %in% c(1951:1980) ~ "1951-1980",
Year %in% c(1981:2010) ~ "1981-2010",
TRUE ~ "2011-present"
))
comparison
## # A tibble: 1,668 x 5
## Year month delta date interval
## <dbl> <ord> <dbl> <date> <chr>
## 1 1881 Jan -0.19 1881-01-01 1881-1920
## 2 1881 Feb -0.25 1881-02-01 1881-1920
## 3 1881 Mar 0.02 1881-03-01 1881-1920
## 4 1881 Apr -0.02 1881-04-01 1881-1920
## 5 1881 May -0.06 1881-05-01 1881-1920
## 6 1881 Jun -0.36 1881-06-01 1881-1920
## 7 1881 Jul -0.06 1881-07-01 1881-1920
## 8 1881 Aug -0.03 1881-08-01 1881-1920
## 9 1881 Sep -0.23 1881-09-01 1881-1920
## 10 1881 Oct -0.4 1881-10-01 1881-1920
## # … with 1,658 more rows
Now that we have intervals, we can study the distribution of monthly deviations (delta), grouped by the different time periods we are interested in.
ggplot(comparison, aes(x=delta, fill=interval)) + #fill to group and colour by different time intervals
geom_density(alpha=0.2) + #density plot with transparency set to 20%
theme_bw( ) + # theme white background and grey lines
theme(
plot.title=element_text(face="bold") # make title bold
) +
labs (
title="Distribution of Monthly Deviations",
subtitle = "Density plot grouped by different time intervals ",
y = "",
x = "Temperature Deviation",
legend.title=""
) +
guides(fill=guide_legend((title="Time Interval"))) #change legend title
So far, we have been working with monthly data for temperature deviaitons. Let us study average annual anomalies next.
#creating yearly averages
average_annual_anomaly <- tidyweather %>%
group_by(Year) %>% #grouping data by Year
# creating summaries for mean delta
summarise(annual_average_delta = mean(delta, na.rm=TRUE)) # use `na.rm=TRUE` to eliminate NA (not available) values
#plotting the data:
ggplot(average_annual_anomaly, aes(x=Year, y= annual_average_delta))+
geom_point()+
geom_smooth(method="loess", color="red") + # fit the best fit line, using LOESS method
theme_bw() +
theme(
plot.title = element_text(face="bold")
) +
labs (
title = "Average Yearly Anomaly",
y = "Average Annual Temperature Deviations"
)
## Confidence Interval for
delta
A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.
We will construct a confidence interval for the average annual ‘delta’ since 2011, using two different methods
formula_CI <- comparison %>%
filter(interval=="2011-present") %>% # filter for time period we're interested in
summarise(mean=mean(delta, na.rm=TRUE), # calculate summary statistics for delta
SD=sd(delta, na.rm=TRUE),
count=n(),
SE=SD/sqrt(count),
lower_CI = mean - 1.96*SE,
upper_CI = mean + 1.96*SE
)
formula_CI
## # A tibble: 1 x 6
## mean SD count SE lower_CI upper_CI
## <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.966 0.262 108 0.0252 0.916 1.02
infer package# use the infer package to construct a 95% CI for delta
library(infer)
set.seed(1234)
whatever_id_like <- comparison %>%
filter(interval=="2011-present") %>% # filtering for time period we're interested in
specify(response=delta) %>% # specifying what we're calculating CI for
generate(reps=1000, type="bootstrap") %>% # generate random samples or reps using bootstrap
calculate(stat="mean") # calculate mean
percentile_CI <- whatever_id_like %>%
get_confidence_interval(comparison$delta, level=0.95, type="percentile")
percentile_CI
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.917 1.02
The first method calculates summary statistics and confidence intervals (CI) using the whole data for temperature deviations (delta) from 2011 to present. The second bootstrap method creates 1000 random samples (reps), and calculates their means on the basis of which we arrive at CIs.
The 95% lower CI and upper CI is 0.917 and 1.02 respectively, which means that out of every 1000 samples, we are confident that for 950 samples the range [0.917, 1.02] would correctly contain the true mean of the population.
# Import approval polls data
approval_polllist <- read_csv(here::here('data', 'approval_polllist.csv'))
# Quickly view data set.
glimpse(approval_polllist)
## Rows: 14,533
## Columns: 22
## $ president <chr> "Donald Trump", "Donald Trump", "Donald Trump", "…
## $ subgroup <chr> "All polls", "All polls", "All polls", "All polls…
## $ modeldate <chr> "8/29/2020", "8/29/2020", "8/29/2020", "8/29/2020…
## $ startdate <chr> "1/20/2017", "1/20/2017", "1/20/2017", "1/21/2017…
## $ enddate <chr> "1/22/2017", "1/22/2017", "1/24/2017", "1/23/2017…
## $ pollster <chr> "Gallup", "Morning Consult", "Ipsos", "Gallup", "…
## $ grade <chr> "B", "B/C", "B-", "B", "B", "C+", "B-", "B+", "B"…
## $ samplesize <dbl> 1500, 1992, 1632, 1500, 1500, 1500, 1651, 1190, 2…
## $ population <chr> "a", "rv", "a", "a", "a", "lv", "a", "rv", "a", "…
## $ weight <dbl> 0.262, 0.680, 0.153, 0.243, 0.227, 0.200, 0.142, …
## $ influence <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ approve <dbl> 45.0, 46.0, 42.1, 45.0, 46.0, 57.0, 42.3, 36.0, 4…
## $ disapprove <dbl> 45.0, 37.0, 45.2, 46.0, 45.0, 43.0, 45.8, 44.0, 3…
## $ adjusted_approve <dbl> 45.8, 45.3, 43.2, 45.8, 46.8, 51.6, 43.4, 37.7, 4…
## $ adjusted_disapprove <dbl> 43.6, 37.8, 43.9, 44.6, 43.6, 44.4, 44.5, 42.8, 3…
## $ multiversions <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ tracking <lgl> TRUE, NA, TRUE, TRUE, TRUE, TRUE, TRUE, NA, NA, T…
## $ url <chr> "http://www.gallup.com/poll/201617/gallup-daily-t…
## $ poll_id <dbl> 49253, 49249, 49426, 49262, 49236, 49266, 49425, …
## $ question_id <dbl> 77265, 77261, 77599, 77274, 77248, 77278, 77598, …
## $ createddate <chr> "1/23/2017", "1/23/2017", "3/1/2017", "1/24/2017"…
## $ timestamp <chr> "13:38:37 29 Aug 2020", "13:38:37 29 Aug 2020", "…
# Data set cleaned.
approval_polllist_clean <- approval_polllist %>%
select(c(enddate, samplesize, adjusted_approve, adjusted_disapprove)) %>%
mutate(approval_difference = adjusted_approve - adjusted_disapprove)
approval_polllist_clean
## # A tibble: 14,533 x 5
## enddate samplesize adjusted_approve adjusted_disapprove approval_difference
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1/22/2017 1500 45.8 43.6 2.18
## 2 1/22/2017 1992 45.3 37.8 7.44
## 3 1/24/2017 1632 43.2 43.9 -0.659
## 4 1/23/2017 1500 45.8 44.6 1.18
## 5 1/24/2017 1500 46.8 43.6 3.18
## 6 1/24/2017 1500 51.6 44.4 7.15
## 7 1/25/2017 1651 43.4 44.5 -1.06
## 8 1/25/2017 1190 37.7 42.8 -5.08
## 9 1/25/2017 2692 41.9 36.9 4.99
## 10 1/26/2017 1678 43.7 45.1 -1.36
## # … with 14,523 more rows
# Use `lubridate` to fix dates, as they are given as characters.
The Average net approval rating of Trump along with the 95% confidence intervals are calculated below.
# The data set of approval ratings are cleaned along with finding the confidence intervals.
Trump_Approval <- approval_polllist_clean %>%
mutate(enddate = mdy(enddate),
year = year(enddate),
week = week(enddate)
) %>%
group_by(year, week) %>%
summarise(Weekly_difference = sum(approval_difference),
avg_difference = mean(approval_difference),
se_Trump = sd(approval_difference)/sqrt(n()),
t_crit = qt(0.975, n() - 1),
lower_CIs = (avg_difference - t_crit * se_Trump),
upper_CIs = (avg_difference + t_crit * se_Trump)
)
Trump_Approval
## # A tibble: 191 x 8
## # Groups: year [4]
## year week Weekly_differen… avg_difference se_Trump t_crit lower_CIs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2017 4 64.2 1.39 0.698 2.01 -0.0102
## 2 2017 5 -120. -1.79 0.517 2.00 -2.82
## 3 2017 6 -290. -3.82 0.612 1.99 -5.04
## 4 2017 7 -541. -6.44 0.615 1.99 -7.66
## 5 2017 8 -528. -6.07 0.519 1.99 -7.10
## 6 2017 9 -330. -4.23 0.508 1.99 -5.24
## 7 2017 10 -348. -3.91 0.575 1.99 -5.05
## 8 2017 11 -666. -7.83 0.599 1.99 -9.02
## 9 2017 12 -636. -7.57 0.754 1.99 -9.07
## 10 2017 13 -1060. -11.6 0.726 1.99 -13.1
## # … with 181 more rows, and 1 more variable: upper_CIs <dbl>
# The approval plot is created along with the confidence intervals.
Approval_Plot <- Trump_Approval %>%
ggplot(aes(x = week, y = avg_difference)) +
geom_ribbon(mapping = aes(ymin = lower_CIs, ymax = upper_CIs), alpha = 0.3)+
geom_line() +
geom_point() +
facet_wrap(~ year) +
geom_hline(mapping = aes(yintercept = 0), color = "orange") +
theme(legend.position = "none",
panel.background = element_rect("white", "black"),
panel.grid.major = element_line("light grey"),
panel.grid.minor = element_line("light grey")
) +
labs (
title = "Estimate Net Approval (approve-disaprove) for Donald Trump",
subtitle = "Weekly average of all polls",
y = "Average Net Approval (%)",
x = "Week of the year"
) +
expand_limits(x = c(0, 52), y = 7.5) +
NULL
Approval_Plot
Ideal picture of plot.
As seen in the approval plot, mid-April shows that the confidence intervals are closer compared to larger confidence intervals in mid August. As standard error is proportional to the variance of trump’s approval ratings, national events that happened mid August, such as the BLM protests and the handling of the Covid crisis, has made the public’s view of Trump far more polarized. Because of this, the standard errors increase widening the confidence intervals.
In this part, we will join a few dataframes with data on:
First, we will compile and clean the dataframes.
# load gapminder HIV and life Expectancy data, turn the data frames into one format
hiv <- read_csv(here::here("data","adults_with_hiv_percent_age_15_49.csv")) %>%
pivot_longer(cols = 2:34, names_to = "date", values_to = "hiv_prv")%>%
mutate(date = as.numeric(date),
hiv_prv = as.numeric(hiv_prv))
NULL
## NULL
life_expectancy <- read_csv(here::here("data","life_expectancy_years.csv")) %>%
pivot_longer(cols = 2:302, names_to = "date", values_to = "life_exp") %>%
mutate(date = as.numeric(date),
life_exp = as.numeric(life_exp))
NULL
## NULL
# get World bank data from local due to the difficulty of connection
worldbank_data <- read_csv(here::here("data","worldbank_data.csv"))
# get a dataframe of information regarding countries, indicators, sources, regions, indicator topics, lending types, income levels, from the World Bank API
countries <- read_csv(here::here("data","countries.csv"))
#merge the 3 data frames using left_join which is more efficient
merge_wbd <- worldbank_data %>%
left_join(., life_expectancy, by = c("country","date")) %>%
left_join(., hiv, by = c("country", "date")) %>%
mutate(region = countries$region[match(country, countries$country)])
#clean the merged data frame
names(merge_wbd)[names(merge_wbd) == "SP.DYN.TFRT.IN"] <- "fertility_rate"
names(merge_wbd)[names(merge_wbd) == "NY.GDP.PCAP.KD"] <- "GDP_cap"
names(merge_wbd)[names(merge_wbd) == "SE.PRM.NENR"] <- "school_enroll"
names(merge_wbd)[names(merge_wbd) == "SH.DYN.MORT"] <- "mortality_rate"
After merging the data, we are to answer the following questions:
#plot the relationship between HIV prevalence and life expectancy
ggplot(merge_wbd, aes(x = life_exp, y = hiv_prv)) +
geom_point(size = 0.2) +
geom_smooth(alpha = 0.5) +
facet_wrap(~region, scales = "free") +
theme_bw() +
labs(title = "Relationship Between Life Expectancy and HIV Prevalance",
x = "Life Expectancy",
y = "HIV Prevalance")
#plot the relationship between fertility rate and GDP per capita
ggplot(merge_wbd, aes(x = fertility_rate, y = GDP_cap)) +
geom_point(size = 0.2) +
geom_smooth() +
facet_wrap(~region, scales = "free") +
theme_bw() +
labs(title = "Relationship Between Fertility Rate and GDP per Capita",
x = "Fertility Rate",
y = "GDP per Capita")
#find the regions with most missing HIV data
library("scales") # package to format data in percentage
merge_wbd %>%
select(country, region, date, hiv_prv) %>%
group_by(region) %>%
summarize(data_total = NROW(hiv_prv),
na_total = sum(is.na(hiv_prv)),
na_pct = scales::percent(round(na_total/data_total,2))) %>%
ungroup() %>%
ggplot(aes(x = na_total, y = reorder(region, na_total))) +
geom_col() +
labs(title = "Europe & Central Asia Has The Most Missing Data in HIV",
subtitle = "Number of Observations & Percentage of Total Regional Observation",
x = "Number of Missing HIV Data",
y = NULL) +
theme_bw(base_size = 12) +
ggrepel::geom_text_repel(aes(label = na_pct), nudge_x = 1, size = 3) +
NULL
#Plot the child(under 5) mortality rate trend by regions
merge_wbd %>%
select(region, country, mortality_rate, date) %>%
drop_na(., mortality_rate) %>%
group_by(region, date) %>%
summarize(mean_mort = mean(mortality_rate)) %>%
ungroup() %>%
ggplot(aes(x = date, y = mean_mort)) +
geom_line() +
facet_wrap(~region, scales = "free") +
labs(title = "Regional Child Mortality Rate Is Descending(1960 - 2016)",
x = NULL,
y = "Mean Mortality Rate") +
theme_bw() +
NULL
#Find the top 5 improved countries with data from 2000 to 2016 as many countries do not have earlier data
new_merge_wbd <- merge_wbd %>%
select(region, country, mortality_rate, date) %>%
filter(date %in% c("2000","2016")) %>%
pivot_wider(names_from = date, values_from = mortality_rate) %>%
drop_na()
names(new_merge_wbd)[3] <- "Y2000" #rename the year columns as they are hard to reference
names(new_merge_wbd)[4] <- "Y2016"
most_improved_mortality_rate <- new_merge_wbd %>%
mutate(mort_impr = (Y2016 - Y2000)/Y2000) %>%
group_by(region) %>%
slice_min(order_by = mort_impr, n = 5) %>%
ungroup() %>%
ggplot(aes(x = mort_impr, y = reorder(country, mort_impr))) +
geom_col() +
facet_wrap(~region, scales = "free") +
labs(title = "5 Countries with the Most Improvement in Mortality Rate 2000-2016",
x = "Percentage Change of Mortality Rate",
y = NULL) +
theme_bw() +
NULL
most_improved_mortality_rate
least_improved_mortality_rate <- new_merge_wbd %>%
mutate(mort_impr = (Y2016 - Y2000)/Y2000) %>%
group_by(region) %>%
slice_max(order_by = mort_impr, n = 5) %>%
ungroup() %>%
ggplot(aes(x = mort_impr, y = reorder(country, mort_impr))) +
geom_col() +
facet_wrap(~region, scales = "free") +
labs(title = "5 Countries with the Least Improvement in Mortality Rate 2000-2016",
x = "Percentage Change of Mortality Rate",
y = NULL) +
theme_bw() +
NULL
least_improved_mortality_rate
> The mortality rate for under 5 shows a decreasing trend amongst all regions from 1960 to 2016. As for improvements over the years, because some countries lack data from 1960, we decide to compare data from 2016 to data 2000 to inlcude as many countries as possible to discover the improvements of mortality rate. A negative percentage change shows a country has improved/decreased its mortality rate, whereas a positive percentage change shows deterioration. Hence 5 countries’ mortality rate deterioated during 2000-2016.
#plot the relationship between primary school enrollment and fertility rate
merge_wbd %>%
select(region, country, school_enroll, fertility_rate) %>%
drop_na(., c(school_enroll, fertility_rate)) %>%
group_by(region, country) %>%
ggplot(aes(x = school_enroll, y = fertility_rate)) +
geom_point(size = 0.5, alpha = 0.3) +
geom_smooth() +
facet_wrap(~region, scales = "free") +
labs(title = "Negative Correlation Between Primary School Enrollment and Fertility Rate",
x = "Primary School Enrollment Rate",
y = "Fertility Rate") +
theme_bw() +
NULL
> There is a negative correlation between primary school enrollment and fertility rate in all regions.